The Lag -Effects of Air Pollutants and Meteorological Factors on COVID-19 Infection Transmission and Severity: Using Machine Learning Techniques

Background: Exposure to air pollution is a major health problem worldwide. This study aimed to investigate the effect of the level of air pollutants and meteorological parameters with their related lag time on the transmission and severity of coronavirus disease 19 (COVID-19) using machine learning (ML) techniques in Shiraz, Iran. Study Design: An ecological study. Methods: In this ecological research, three main ML techniques, including decision trees, random forest, and extreme gradient boosting (XGBoost), have been applied to correlate meteorological parameters and air pollutants with infection transmission, hospitalization, and death due to COVID-19 from 1 October 2020 to 1 March 2022. These parameters and pollutants included particulate matter (PM2), sulfur dioxide (SO2 ), nitrogen dioxide (NO2 ), nitric oxide (NO), ozone (O3 ), carbon monoxide (CO), temperature (T), relative humidity (RH), dew point (DP), air pressure (AP), and wind speed (WS). Results: Based on the three ML techniques, NO2 (lag 5 day), CO (lag 4), and T (lag 25) were the most important environmental features affecting the spread of COVID-19 infection. In addition, the most important features contributing to hospitalization due to COVID-19 included RH (lag 28), T (lag 11), and O3 (lag 10). After adjusting for the number of infections, the most important features affecting the number of deaths caused by COVID-19 were NO2 (lag 20), O3 (lag 22), and NO (lag 23). Conclusion: Our findings suggested that epidemics caused by COVID-19 and (possibly) similarly viral transmitted infections, including flu, air pollutants, and meteorological parameters, can be used to predict their burden on the community and health system. In addition, meteorological and air quality data should be included in preventive measures.

sociodemographic factors, living conditions, and environmental factors, including pollutants. 7,8In this regard, several researchers hypothesized that the disease could spread more rapidly due to the occurrence of various environmental conditions. 9,10tudies suggest that exposure to air pollutants increases the risk of transmission of viral infections. 11,12For example, the transmission of SARS-CoV-2 infection is associated with certain levels of air pollutants, air temperature (T), and relative humidity (RH). 10,1Several studies suggest that certain infectious agents, including coronaviruses, may survive longer in polluted air 14,15 and that exposure to air pollutants, especially particulate matter (PM) and toxic gases, can weaken the body's respiratory defense by damaging the lining of the respiratory tract, making it easier for viruses to penetrate and form infection. 15,16 In addition, air pollution can trigger inflammation in the respiratory system and suppress the function of the immune system, making individuals more susceptible to severe infections.For instance, evidence indicates that exposure to air pollutants may increase the risk of hospitalization and death among individuals infected by COVID-19. 12,15Making the subject more complicated, it seems that meteorological factors not only affect the level of air pollutants but also may alter the effect of pollutants on the body's functions. 17,18lthough several studies have been conducted on the role of environmental factors in the incidence and mortality of COVID-19 infection during the 2019 pandemic, considering the long list of meteorological factors and air pollutants, very few have spotted the light on the time interval between changes in the levels of air pollutants and meteorological parameters and the incidence of COVID-19 infection and its severity (i.e., hospitalization and death), in one modeling framework. 19achine learning (ML) methods have been extensively used to detect and predict various diseases. 20Therefore, the current study was conducted to define the relative importance of meteorological and pollution parameters in predicting the infection and mortality of COVID-19 using ML techniques and investigate the lag time effects of air quality factors on the outcome indexes.The selected air parameters included T, RH, dew point (DP), air pressure (AP), and wind speed (WS), as well as the concentrations of PM2, SO 2 , NO 2 , NO, O 3 , and CO.

Setting
This ecological study was conducted from October 1, 2020, to March 1, 2022, in Shiraz.This period was selected because the least changes were observed in applying the major control strategies, including vaccination, diagnosis, community restrictions, and the like.Shiraz is the capital of Fars Province, located in the south of Iran.Its population size is about 1 955 500 people, making Shiraz the fourth largest and most populous city in Iran.The average annual T of the city is about 17.8 °C, with a maximum T of 43.2 °C in July and a minimum T of below-freezing in January.The average annual rainfall is about 283.9 mm, and the average height of this city is 1486 meters above sea level.

Air pollution data
The data on air pollutants, including CO, O 3 , NO 2 , SO 2 , PM2.5, and NO, were collected from the Environmental Organization of Fars Province.The organization measures the selected pollutants on an hourly basis.After scanning the data, the outliers and missing values of various pollutants have been defined and imputed by the random forest (RF) method.Then, the average concentration over 24 hours was calculated for the pollutants.The average concentration over 24 hours was used in this study.

Meteorological parameters
The meteorological data from October 1, 2020, to March 1, 2022, which was utilized in the present study included AP, T, DT, RH, and WS.The data are available at https:// www.wunderground.com.

Coronavirus disease 19 data
The required COVID-19 data (i.e., the number of positive tests based on polymerase chain reaction tests and the number of hospitalizations and deaths due to COVID-19 daily) were obtained from the COVID-DASHBOARD and MCMC (Medical Care Monitoring Center) databases of Shiraz University of Medical Sciences.The reported COVID-19 data are being regularly updated daily.

Statistical analysis
After checking the normality assumption, the Pearson correlation coefficient was used to measure the lag time effect of independent variables, namely, environmental (O 3 , NO 2 , SO 2 , PM2.5, NO, and CO) and meteorological (AP, T, DP, RH, WS) parameters on the number of cases, number of hospitalizations, and number of deaths due to COVID-19.A relationship was considered statistically significant at P < 0.05.Statistical analysis was performed using the STATA software, version 17.
Lag times with the strongest correlation were selected to be included in the ML procedure to define the relative importance of the parameters.The relative importance of the air parameters is defined via conducting an ML approach (a well-known approach playing a crucial role in the epidemiology of diseases by providing more accuracy in description, prediction, and decision-making in advanced epidemiological studies). 21Three modeling methods (i.e., decision tree [DT], RF, and extreme gradient boosting [XGBoost]) were employed to define and compare three lists of the relative importance of the study parameters in the incidence and severity of COVID-19 infection. 22,23n general, DTs are basic models that can overfit despite being widely used.RF is an ensemble method that reduces overfitting by averaging predictions, and XGBoost is a powerful gradient-boosting algorithm that sequentially improves model performance by correcting errors.Each algorithm has its strengths and is suitable for different aspects of our study questions.All methods are utilized to define the relative importance of the study parameters and the majority voting index.
Among several measures of model performance (e.g., R-squared, mean absolute error, mean squared error, and root mean square error) available in similar areas of research, 24 R 2 was used to compare the performance of the models.This is because it seems that R 2 is a more informative and comparative metric. 25The models were trained and validated using the Python software, version 3.8.

Machine learning approach
In this study, feature selection techniques based on ML methods were employed to identify the various environmental and meteorology parameters that affect the number of cases, hospitalizations, and deaths due to COVID-19.Assuming the usual time lag from virus transmission to apparent infection (Max.= 14 days), infection to hospitalization (Max.= 10 days), and death (Max.= 10 days), 26 the lag times of the effects of daily air pollutants and meteorological parameters on the COVID-19 infection and death were set between 0 and 30 days.Moreover, considering the important effect of the number of positive cases diagnosed on the number of hospitalizations and the number of deaths, this variable was entered into the model.
The operational flow of the ML algorithm is depicted in Figure 1. 1.The datasets of air pollutants (PM2.5, CO, NO 2 , NO, SO 2 , and O 3 ) and meteorological parameters (AP, air T, DP, RH, and WS) were used as predictors, and the number of cases, number of hospitalizations, and number of deaths due to COVID-19 were employed as response variables.2. To preprocess the dataset, the missing values of various pollutants were imputed by the RF method.
The imputed dataset was split into 80% as train data and 20% as test data.3.In this study, several ML techniques (i.e., DTs, RF, and XGBoost) were utilized to analyze the data via modeling and computing the feature's importance and lag time of their effects.These techniques are described in more detail in the following sections: Decision tree DT is a non-linear algorithm.It represents the parameters by a node in the tree, and the values of these parameters are represented by the respective branches of the node.The DT performs the division of input based on the values of various parameters.A DT classifier's design is influenced by the way the tree is structured, the feature subsets used at each internal node, and the decision criteria utilized at each node.Some of the criteria for the design of the tree structure include error rates, the number of nodes in the tree, and information gain.Entropy and information-based techniques are generally employed as a decision rule at each node, whereas branch and bound techniques and greedy algorithms are applied for feature subset selection. 19

Random forest
The RF constitutes a supervised ensemble learning methodology that predicts based on DTs.This approach involves the prediction of multiple trees, each independently trained, and then the final result is averaging the values.Samples from the dataset equal to the number of trees needed to build the RF are taken, and a tree is grown by selecting the best split among all the input variables at each node.The process of making predictions is performed by aggregating all of the sample trees' predictions.It is highly flexible and fast; it can be used for both classification and regression.The majority voting among predictions and the average of predictions are utilized for classification and regression, respectively. 19,27treme gradient boosting XGBoost is a successful ML method based on a gradientboosting algorithm.It has better control over overfitting by using more regularized model formalization in comparison to many other algorithms.A grid search on hyperparameters with 10-fold cross-validation was performed to find the best model based on R 2 metrics. 27.To train the model using the ML techniques, first, the tuning parameters for each model are chosen, and then resampling is performed using the crossvalidation method.5.The importance of the feature is computed based on the model statistic.A feature is considered important if a reduction in the model statistic is observed when that feature is added to the model.For DT, RF, and XGBoost, the MSE is employed as the model statistic.
The performance of each ML model was measured and compared using R 2 . 24,286.The importance of the selected final factors was determined by majority voting with intermediatelevel fusion approaches ( ≥ 66%) to integrate the results of the three mentioned ML techniques. 29n majority voting, a group of diverse ML models is trained on the same dataset, each employing unique algorithms or techniques.When it is time to make a prediction, each model casts its vote for the outcome, and the final prediction is determined by the majority's decision.It is a useful technique for improving the accuracy and robustness of the final prediction, particularly when the individual models have different strengths and weaknesses, and their combination can lead to better overall performance.

Results
In total, 235 766 cases, 29 894 hospitalizations, and 3,229 deaths due to COVID-19 were registered with the COVID-DASHBOARD and MCMC registry in Shiraz during the study period (515 days).Correlations between daily air pollutants and meteorological factors with the number of cases, number of hospitalizations, and number of deaths due to COVID-19 are provided in Table 1.
Based on the results, the number of cases of COVID-19 had a positive and significant correlation with O 3 , NO 2 , PM2.5, T, DP, and WS.On the other hand, a negative correlation was observed between the number of cases of COVID-19 and NO, CO, humidity, and pressure.In addition, the number of hospitalizations due to COVID-19 had a positive correlation with the detected number of cases of COVID-19, O 3 , NO 2 , PM2.5, T, DP, and WS, while there was a negative correlation with NO, CO, SO 2 , humidity, and pressure.The number of deaths due to COVID-19 had a positive correlation with the detected number of cases of COVID-19, O 3 , NO 2 , CO, PM2.5, T, WS, and DP, while there was a negative correlation with NO, humidity, and pressure.For daily NO 2 , the highest correlation with the number of COVID-19 cases was found at lag 8 day (r = 0.63), lag 0 day with hospitalizations due to COVID-19 (r = 0.39), and lag 20 day with deaths due to COVID-19 (r = 0.73).
The importance of each air pollutant and meteorological factor, as computed by the various ML techniques, is presented in Tables 2-4, demonstrating the performance of three ML methods using R 2 results.
The most important features affecting COVID-19 infection transmission based on the three techniques were NO 2 at lag 5 day, CO at lag 4 day, T at lag 25 day, and O 3 at lag 10 day (Table 2).Further, based on the majority voting with full prediction agreement of all the considered 3 models (100%), NO 2 was reported to be the most important influencing factor in the transmission of the COVID-19 infection.The performance of DT, RF, and XGBoost predictive models for the COVID-19 infection was R 2 = 0.56, R 2 = 0.64, and R 2 = 0.73, respectively (Table 2).
The most important features affecting the severity of the disease and hospitalization based on the three ML techniques were RH at a lag of 28 days, T at a lag of 11 days, and O 3 at a lag of 10 days (Table 3).Furthermore, based on the majority voting with moderate prediction agreement (2 out of 3 models, 66.7%), humidity was the most effective factor related to the severity of COVID-19.The performance of DT, RF, and XGBoost predictive models for hospitalizations due to COVID-19 was R 2 = 0.68, R 2 = 0.79, and R 2 = 0.87, respectively (Table 3).
Moreover, the most important features affecting mortality due to COVID-19 found by the three ML techniques were NO 2 at lag 20 day, O 3 at lag 22 day, and NO at lag 23 day (Table 4).Based on the majority voting with moderate prediction agreement (2 out of 3 models, 66.7%), NO 2 was the most important factor affecting mortality related to COVID-19.The performance of DT, RF, and XGBoost predictive models for the number of deaths due to COVID-19 was R 2 = 0.53, R 2 = 0.63, and R 2 = 0.72, respectively (Table 4).

Discussion
Despite the large bodies of research on COVID-19, there are still huge gaps in understanding the mechanisms of infection transmission and the development of the disease.
Various studies have shown that air pollution can increase the transmissibility and severity of several airborne infections, including coronavirus. 15,30The present study was conducted in Shiraz, the fourth largest city in Iran, to estimate the importance of major air pollutants (NO 2 , NO, O 3 , SO 2 , CO, and PM2.5) and meteorological parameters (AT, AH, WS, DP, and AP) on the pattern of case detection, hospitalization, and death due to COVID-19 using the ML techniques.
According to the results, XGBoost provided the highest performance when compared to DT and RF.Positive and significant relationships were found between exposure to NO 2 , O 3 , PM2.5, AT, WS, and DP at particular lag s with the number of cases of COVID-19 and hospitalization.However, inverse associations were observed between NO, CO, RH, and AP levels and the number of new cases of COVID-19 and hospitalization.In addition, positive associations were found between NO 2 , O 3 , PM2.5, AT, WS, DP, and CO with the number of deaths caused by COVID-19.On the other hand, there was an inverse relationship between NO, RH, and AP and the number of deaths caused by COVID-19.
Regarding the contribution of NO 2 to the risk of COVID-19 infection, hospitalization, and mortality, the results of a study by Frontera et al also reported that NO 2 may play an important role in the occurrence of severe forms of COVID-19. 31.NO 2 is released into the environment due to incomplete combustion in fuel engines and any process involved in burning coal, oil, or natural gas. 32In general, exposure to this pollutant is associated with an increased risk of mortality from respiratory and cardiac diseases.NO 2 also causes alterations in the respiratory and immune functions, compromises the airflow in the airways, and ultimately facilitates complications of respiratory infections. 15.Similarly, other researchers found a significant association between NO 2 levels in the air and mortality from COVID-19.For example, a study conducted in 12 cities in Iran reported a relationship between air pollution (including NO 2 ) and death due to COVID-19. 33Another study suggested that exposure to NO 2 is more effective in increasing the mortality caused by COVID-19 compared to other air pollutants. 34n another study that was performed in 3 cities in Iran, NO 2 was significantly associated with the mortality and severity of the COVID-19 infection. 6Regarding other pollutants, a study in India demonstrated, in addition to NO 2 , a significant rise in the number of COVID-19 cases following short-term exposure to PM2.5 and PM10. 35imilarly, the findings of another study revealed that exposure to PM2.5 was significantly associated with an increased risk of COVID-19 infection.It was also found that exposure to PM2.5, PM10, SO 2 , NO 2 , O 3 , and CO were associated with death due to COVID-19. 36In addition, a study conducted in Arak, Iran, indicated a positive and significant relationship between PM2.5, PM10, and SO 2 and detected cases and deaths caused by COVID-19.In addition, air pollutants, including NO 2 , CO, O 3 , WS, and RH, were inversely associated with hospitalization and mortality of COVID-19. 37According to the WHO, the O 3 level in the atmosphere causes respiratory problems and asthma and reduces lung function. 38In a study performed in India, PM2.5, PM10, SO 2 , and O 3 were positively associated with the number of infections and deaths due to COVID-19. 38Likewise, research in China showed that the number of COVID-19 cases was positively correlated with the level of PM2.5, NO 2 , O 3 , and SO 2 pollutants. 39everal studies reported an association between hospitalization and exposure to SO 2 . 37Similarly, Liu et al in China demonstrated that the rate of spread of the COVID-19 infection increased with exposure to SO 2 . 40owever, Zhu et al found no association between exposure to SO 2 and hospitalization. 39t seems that SARS-CoV-2 can spread wider in windy conditions because it increases air circulation. 41In the present study, a direct relationship was observed between average WS and mortality due to COVID-19.A study conducted in Malaysia revealed that the cases of COVID-19 were negatively correlated with WS. 24 Air PMs, especially PM2.5, may not only carry SARS-CoV-2 but also promote virus attachment and replication in the bronchi by damaging bronchial epithelial cells, ultimately leading to increased hospitalizations and death due to COVID-19. 42,43everal studies evaluated the role of meteorological parameters in the transmission of COVID-19. 44,45ccordingly, different environmental conditions were found to affect the spread of the infection by changing several parameters, including air conditioning and circulation. 45Air T and humidity were the most important meteorological parameters affecting mortality caused by COVID-19.Air T could affect the spread of COVID-19 under different conditions.For example, high RH was reported to have an inhibitory effect on the transmission of COVID-19. 46For instance, according to a study performed in Ahvaz, high air T and high RH led to a significant decrease in the daily incidence of COVID-19. 13nother study in the UK found that high T and long DPs could reduce the incidence of COVID-19 and its related mortality.On the other hand, WS significantly increased the above indexes. 41In line with the results of the present study, some studies showed a positive and significant correlation between air T and the number of patients with COVID-19. 47This is due to the hot weather outside, which forces people to stay in closed environments Air pollutants and COVID-19 infection transmission without ventilation, and the risk of disease transmission increases as a result of insufficient ventilation in indoor spaces. 48,49However, several studies reported an inverse correlation between air T and the spread and mortality of coronavirus. 50.It is also possible that due to sunlight and UV radiation in the seasons with high T, environmental conditions are tougher for any infectious agent in the air, including COVID-19. 24,37,41egarding the studies discussed above, no lag time was investigated when investigating the effect of air pollutants and meteorological factors on COVID-19 infection and mortality.At the current point, we are unable to interpret the lag s between exposure to the air and the COVID-19 infection.However, we believe that defining the lag time with an interval of several days compared to the one-day model provides a better understanding of the effects of air pollution on airborne infections, including COVID-19.

Strengths and Limitations
A range of major meteorological and air quality indexes were used to better understand the association and temporality between these factors and COVID-19 infection incidence and severity.However, the nature of the study design prevents us from making causal inferences from the results, as many unmeasured factors (e.g., fastchanging and hard-to-track behavioral factors in the study population regarding the transmission of infection, detection of infection, and hospitalization) are potentially confounding the associations under study.

Conclusions
In this study, several air pollutants and meteorological parameters were analyzed to measure the correlation between NO 2 , NO, O 3 , SO 2 , PM2.5, CO, air T, RH, WS, DP, and AP and the transmission and severity of the COVID-19 infection in Shiraz.The findings confirmed positive and strong (r > 0.5) associations between NO 2 (lag 8 days) and O 3 (lag 6 days) with case detection, and NO 2 (lag 20 days) and O 3 (lag 23 days) with mortality of COVID-19 cases.Humidity (lag 26 and lag 30 days) and pressure (lag 30 and lag 29 days) were negatively and significantly associated with both case detection and death due to COVID-19.Our results suggested that air pollutants and meteorological parameters with particular lag times play important roles in the epidemiology of the disease.Thus, it is necessary to perform further studies to understand the mechanism of the delayed effect of such pollutants on the natural history of the COVID-19 infection.Regarding the application of the results, the air quality and condition parameters should be used when studying such airborne epidemics and should be considered when developing and implementing primary and secondary preventive strategies.Specific measures, such as traffic management and specifically targeted partial community or social event lockdowns, can be taken to reduce air pollution and infection transmission in communities.Meteorological and air quality data can be integrated into public health policies to mitigate the impact of COVID-19 and other respiratory diseases.
• NO 2 (lag 5 day), CO (lag 4), and temperature (lag 25) are the most important environmental features affecting the spread of the COVID-19 infection.

Table 1 .
The highest correlation between daily air pollutants and meteorological factors with the number of cases, number of hospitalizations, and number of deaths due to COVID-19 at different lag times (0-30 days) Note.COVID-19: Coronavirus disease 19; NO 2 : Nitrogen dioxide; O 3 : Ozone; NO: Nitric oxide; SO 2 : Sulfur dioxide; CO: Carbon monoxide; PM: Particulate matter.

Table 2 .
The relative importance of air pollutants and meteorological factors on COVID-19 infection transmission based on machine learning techniques

Table 4 .
The Relative importance of air pollutants and meteorological factors on mortality due to COVID-19 based on machine learning techniques
• The most important features affecting hospitalization due to COVID-19 are RH (lag 28), temperature (lag 11), and O 3 (lag 10).• The most essential features affecting the number of deaths caused by COVID-19 are NO 2 (lag 20), O 3 (lag 22), and NO (lag 23).• Air pollutants and meteorological parameters can be used to predict the disease burden of COVID-19 on the community and health system.